Introduction to Medical Statistics 2024
Exercises Class II
Exploratory Data Analysis

Author

Ronald Geskus

Published

August 19, 2024

Again, first copy the R code to your work file (script, notebook, markdown) before running.

We will investigate relationships between variables, using the same data set cmTbmData.csv as this morning. We first import the data set again.

I. Baseline table

We will start with some numerical summaries.

  1. Summarize the variables age, white cell count in blood (bldwcc), white cell count in cerebrospinal fluid (csfwcc) and sex by patient group (variable groupLong) in a nice tabular format. Use the tbl_summary function from the gtsummary package. First load the gtsummary package into your R session. Specify the arguments data, by and include. How many patients are in each group? Are there any missing values for these variables? What do the numbers represent?

We change and polish the table a bit. We add the units of age and white cell count. For this we use the label argument. We additionally report mean and standard deviation for age and both csf and blood white cell count via the statistic argument. We change the label “Unknown” into “missing” via the missing_text argument. With binary variables like sex, it is usually sufficient to show only one of the two levels, because the values for the other level is simply the rest. Hence, we remove one of the values in sex. Run the code and have a look at the results. We made one error which you will notice when looking at the output; please correct it. Does the distribution of the variables clearly differ by patient group? Can you tell more about skewness of the three numeric variables?

II. Visualization of CSF white cell count by subgroup

We will switch to a graphical display of CSF white cell count, split by patient group.

  1. Draw a boxplot of CSF white cell count for each of the four patient groups. Add the raw data points to the plot. Do this for both the original values as well as for the log-transformed CSF values that we created in the morning. What information do these plots give you with respect to skewness of the variables? How does the distribution vary by patient group. Relate the figures to the numerical summaries that you made earlier. Can you explain the warning messages?
  1. Make a frequency polygon of the log-transformed CSF white cell count for each of the four patients groups separately, but plotted on top of each other (this is not available in the ggplot2 GUIs). Distinguish between the four groups by colour. A density plot is a graphical summary that can be seen as a “smoothed” version of a histogram. Instead of the binwidth in a histogram, we now specify the level of detail via a bandwidth parameter. Make the density plots for each of the four patients groups separately, but again plotted on top of each other. We make the colours slightly transparent via the alpha argument. See what happens if you change the arguments alpha and adjust in the density plot. Which plot type do you prefer?
  1. A violin plot is an alternative to the boxplot that summarizes data in a similar fashion as the density plot. Summarize CSF white blood count by patient group in a violin plot. Give each of the four groups a different colour. This time, make a separate panel (“facet”) for males and females, next to each other. Do you see any difference between males and females?

Make the same violin plot, but now add the individual values.

  1. Make a raincloud plot instead of the violin plot. Which plot type do you prefer?

III. Relation between blood and CSF white cell count.

  1. CSF white cell count is harder to measure than white cell count in blood. It would be great if we could measure blood wcc to obtain an idea of CSF wcc. Make a scatterplot of CSF white cell count (y-axis) against blood white cell count (x-axis). Use an appropriate transformation for both (you can make histograms to find the best one). Do you observe a relation between both? Would it be feasible to predict CSF white cell count based on white cell count in blood?
  1. Quantify the strength of the relation via an appropriate correlation. Try both the Pearson and Spearman rank correlation, and both with the original CSF WCC as well as the log-transformed values. Use the original values for blood WCC. What do you observe?
  1. Repeat exercise a., but now split up by sex and patient group. What additional information can we obtain from this figure?

Graphical summary of several variables at once.

  1. Make a pairwise summary plot for the variables age, log10 of CSF WCC and sex using the command below. What type of summaries do you see? What do you conclude with respect to the three variables?